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MINIMUM BAYES ERROR FEATURE SELECTION 



IN SPEECH RECOGNITION 

Field of the Invention 

The present invention relates to speech recognition and to methods and apparatus 
for facilitating the same. 

Background of the Invention 

Modern speech recognition systems use cepstral features characterizing the 
short-term spectrum of the speech signal for classifying frames into phonetic classes. 
Cepstral features are features that are typically obtained through an orthogonal 
transformation (such as a discrete cosine transform) of short-term spectral features. 
These cepstral features are augmented with dynamic information from the adjacent 
frames to capture transient spectral events in the signal. What is commonly referred to as 
MFCC+ A + AA features include "static" mel-frequency cepstral coefficients (usually 13) 
plus their first and second order derivatives computed over a sliding window of typically 9 
consecutive frames yielding 39-dimensional feature vectors every 10ms. One major 
drawback of this front-end scheme is that the same computation is performed regardless 
of the application, channel conditions, speaker variability, etc. In recent years, an 
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alternative feature extraction procedure based on discriminant techniques has emerged, 
wherein the consecutive cepstral frames are spliced together forming a supervector which 
is then projected down to a manageable dimension. One of the better known objective 
functions for designing the feature space projection is linear discriminant analysis (LDA). 

LDA, as discussed in Duda et al., "Pattern classification and scene analysis" 
(Wiley, New York, 1973) and Fukunaga, ""Introduction to statistical pattern recognition" 
(Academic Press, New York, 1973), is a standard technique in statistical pattern 
classification for dimensionality reduction with a minimal loss in discrimination. Its 
application to speech recognition has shown consistent gains for small vocabulary tasks 
and mixed results for large vocabulary applications (see Haeb-Umbach et al, "Linear 
Discriminant Analysis for improved large vocabulary continuous speech recognition", 
Proceedings of ICASSP '92, and Kumar et al., "Heteroscedastic discriminant analysis and 
reduced rank HMM's (Hidden Markov Models) for improved speech recognition", 
Speech Communication, 26:283-297, 1998). Recently, there has been an interest in 
extending LDA to heteroscedastic discriminant analysis (HDA) by incorporating the 
individual class covariances in the objective function (see Kumar et al, supra, and Saon et 
al., "Maximum likelihood discriminant feature spaces", Proceedings of ICASSP '2000, 
Istanbul, 2000). Indeed, the equal class covariance assumption made by LDA does not 
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always hold true in practice making the LDA solution highly suboptimal for specific cases 
(see Saon et al, supra). 

However, since both LDA and HDA are heuristics, they do not guarantee an 
optimal projection in the sense of a minimum Bayes classification error (i.e., a minimum 
probability of misclassification). A need has thus been recognized in connection with 
selecting features on the basis of a minimum probability of misclassification. 

Summary of the Invention 

In view of the foregoing, the present invention, in accordance with at least one 
presently preferred embodiment, broadly contemplates employing feature space 
projections according to objective functions which are more intimately linked to the 
probability of misclassification. More specifically, the probability of misclassification in 
the original space, s, will be defined, as well as in the projected space, 8e ? while conditions 
will be given under which s e = s. Since after a projection y = 9x discrimination 
information is usually lost, the Bayes error in the projected space will always increase, that 
is s e > s. Therefore, minimizing s e amounts to finding 9 for which the equality case holds. 

An alternative approach is to define an upper bound on Se and to directly minimize 
this bound. 



YOR920000388US1 



In summary, one aspect of the present invention provides a method of providing 
pattern recognition, the method comprising the steps of: inputting a pattern; transforming 
the input pattern to provide a set of at least one feature for a classifier; the transforming 
step comprising the step of minimizing the probability of subsequent misclassification of 
the at least one feature in the classifier; the minimizing step comprising: developing an 
objective function; and optimizing the objective function through gradient descent. 

Another aspect of the invention provides apparatus for providing pattern 
recognition, the apparatus comprising: an input interface for inputting a pattern; a 
transformer for transforming the input pattern to provide a set of at least one feature for a 
classifier; the transformer being adapted to minimize the probability of subsequent 
misclassification of the at least one feature in the classifier; the transformer further being 
adapted to: develop an objective function; and optimize the objective function through 
gradient descent. 

Furthermore, an additional aspect of the present invention provides a program 
storage device readable by machine, tangibly embodying a program of instructions 
executable by the machine to perform method steps for providing pattern recognition, the 
method comprising the steps of: inputting a pattern; transforming the input pattern to 
provide a set of at least one feature for a classifier; the transforming step comprising the 
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step of minimizing the probability of subsequent misclassification of the at least one 
feature in the classifier; the minimizing step comprising: developing an objective function; 
and optimizing the objective function through gradient descent. 

For a better understanding of the present invention, together with other and further 
features and advantages thereof, reference is made to the following description, taken in 
conjunction with the accompanying drawings, and the scope of the invention will be 
pointed out in the appended claims. 

Brief Description of the Drawings 

Figure 1 schematically illustrates a general pattern recognition arrangement. 

Figure 2 schematically sets forth a method of minimum Bayes error feature 
selection. 

Figure 3 illustrates the evolution of objective functions for divergence. 
Figure 4 illustrates the evolution of objective functions for the Bhattacharyya 

bound. 
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Fig. 1 illustrates a general arrangement 100, such as a speech recognition 
arrangement, in which an input pattern 102, such as a spoken utterance, enters a feature 
extractor 104, from which features 106 will progress to a classifier 108. The output 110 
of classifier 108 will go into a post-processor 1 12, from which the final output 1 14 
emerges. The makeup and function of a feature extractor, classifier and post-processor 
are generally well-known to those of ordinary skill in the art. Duda et al, supra, provides 
a good background discussion of these and other general concepts that may be employed 
in accordance with at least one presently preferred embodiment of the present invention. 

Towards extracting features from extractor 104, the present invention broadly 
contemplates the use of minimum Bayes error feature selection, indicated schematically at 
117, and as will be elucidated upon herebelow. 

Reference is made immediately herebelow and throughout to Figure 2, which 
schematically illustrates a method for providing minimum Bayes error feature selection. 

With regard to Bayes error, one may first consider the general problem of 
classifying an /?-dimensional vector x (input 102) into one of C distinct classes. Records 
(104) are input and a full-covariance gaussian clustering of the records is undertaken for 
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every class (122). By way of means, covariances and priors (124), an objective function is 
formed (126), and the objective function is preferably optimized through gradient descent 
(130). If the optimization converges (132), then all of the records x are transformed into 
y = 0x, and the resulting output (106) represents the final features for the classifier 108 
(see Fig. 1). 

This portion of the disclosure first addresses the Bayes error rate and its link to the 
divergence and the Bhattacharyya bound, as well as general considerations relating to 
minimum Bayes error feature selection. 

Let each class / be characterized by its own "prior" (i.e., prior probability) X t and 
probability density function /?,,/ = 1 . ,C. Assume that x is classified as belonging to 
class j through the Bayes assignment: 



The expected error for this classifier is called Bayes error (see Fukunaga, supra), or 
probability of misclassification, and is defined as 



j = argmax^^^ (x)dx . 




(1) 
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Suppose next that the linear transformation f : 9T -» y =/x) = 9x is 
performed, with 6 being a p x w matrix of rank /? < w. Moreover, one may denote by p t B 
the transformed density for class /. The Bayes error in the range of 6 now becomes 

e = l-\ maxA l pf(y)dy (2) 

JR p \<i<C 11 V ' 

Since the transformation y = 9x produces a vector whose coefficients are linear 
combinations of the input vector x, it can be shown (see Decell et al., "An iterative 
approach to the feature selection problem", Proc. Purdue Univ. Conf On Machine 
Processing of Remotely Sensed Data, 3B1-3B12, 1972) that, in general, information is 

lost and s e > e. 

/\ 

For a fixed p, the feature selection problem can be stated as finding 0 such that 

0- argmin s 9 (3) 

However, an indirect approach to equation (3) is now contemplated: by maximizing the 
average pairwise divergence and relating it to s e and by minimizing the union 
Bhattacharyya bound on 89. 
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In Kullback, "Information theory and statistics" (Wiley, New York, 1968), the 
symmetric divergence between class /' and j is given by 

d(u)=[ W 



D(iJ) represents a measure of the degree of difficulty of discriminating between 
the classes (the larger the divergence, the greater the separability between the classes). 
Similarly, one can define D e (ij), the pairwise divergence in the range of 0. Kullback, 
supra, showed that Dgfij) | D(i,j). If the equality case holds, then 9 is called a "sufficient 
statistic for discrimination." The average pairwise divergence is defined as 

D = -^~^L^D[ij) and respectively D e =-^-^^D,(i,j). It follows 
that D 6 <D. 

The following theorem, from Decell et aL, supra, provides a link between Bayes 
error and divergence for classes with uniform priors h = ... = K(= 1/C ): 



Theorem: IfD e = D then s 9 = s. 



The main idea of the proof of the above theorem is to show that if the divergences 
are the same then the Bayes assignment is preserved because the likelihood ratios are 
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p ( x ) p & ( x ) 

preserved almost everywhere: ; = ' g ; ; J* j\ The result follows by noting that 



Pr( x ) P°( x ) 



for any measurable set A a ffl 



\/ t {y)dy = l 1{A)Pl {x)dx 



(5) 



where 9" (A) = {x e 9T | 0x e A} The previous theorem provides a basis for selecting 9 



such as to maximize D& 



The assumption may now be made that each class / is normally distributed with 
mean \i t and covariance 2 f? that is, p t (x) = N(x and 

Pi (y) - N {y \ @Mi ?6 2, 0 T ] , / = 1, ... , C. It is straightforward to show that, in this case, 
the divergence is given by 



+2 



-i 



Thus, the objective function to be maximized becomes 



0) 
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T 

where S, = £l, +(//, -Mj) J = \---,C . 



Following matrix differentiation results from Searle, "Matrix algebra useful for 
statistics" (Wiley Series in Probability and Mathematical Statistics, New York, 1982), Z) 6 
(indicated at 128 in Fig. 2) has a gradient with respect to 0 and has the expression 



3D L = 
dO C 



(C-1) I=1 V L 



(8) 



The use of equation (8) is indicated in Fig. 2 at 130. 



3D 

Unfortunately, it turns out that — j- = 0 has no analytical solutions for the 

stationary points. Instead, one has to use numerical optimization routines for the 
maximization of Dq. 



An alternative way of minimizing the Bayes error is to minimize an upper bound on 
this quantity. First, the following statement will be proven: 



1<i<j<C 



Indeed, from Decell et al., supra, the Bayes error can be rewritten as 



YOR920000388US1 



- 11 - 



= | minYX.p, (x)dx 



and for every x, there exists a permutation of the indices a x : { 1 , . . . , C} -> { 1 , . . . ,C} such 
that the terms Xpi(x), X c pc(x) are sorted in increasing order, i.e. 
\(i ( x ) * * ^ K x{ c)PoAc) ( x ) ■ Moreover, for 1 < k < C - 1 



from which it follows that 



C-l 

gg 2 V, W - Z ^Aw W - A(*Aw to \ to 



(12) 



i<i</<c 



which, when integrated over 9T, leads to equation (9). 



As previously, if it is assumed that the p/s are normal distributions with means 
and co variances E,, the bound given by the right-hand side of equation (9) has the closed 
form expression 
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(13) 



\<iij<C 



where 



2, +2, 



--1 

(A-^) + -log 



2 



(14) 



is called the Bhattacharyya distance between the normal distributions p x and pj (see 
Fukunaga, supra). Similarly, one can define Pe(i,j\ the Bhattacharyya distance between 
the projected densities pf and Combining equations (9) and (13), one obtains the 
following inequality (indicated in Fig. 2 at 126) involving the Bayes error rate in the 
projected space: 



(15) 



The following simplifying notations will now be introduced: 



B y = Um, - ~ Mjf and 
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From equation (14), it follows that: 





ew v e T 











(16) 



(indicated in Fig. 2 at 126) and the gradient of Bq (indicated in Fig. 2 at 128) with respect 
to 0 is 



as 



9 



30 



\<i<3<C UU 



(17) 



(indicated in Fig. 2 at 130) with, again by making use of differentiation results from 
Searle, supra 



QPe (»'» J) _ 1 



= ~{ew„e T Y gb„o t {ew„e T Y ew„ -ob x 



+ 



(ew v e T Yew v - 



2 



(18) 



The use of equation (18) is indicated in Fig. 2 at 130. 



In connection with the foregoing discussion, speech recognition experiments were 
conducted on a voicemail transcription task (see Padmanabhan et al. ? "Recent 



improvements in voicemail transcription", Proceedings of EURO SPEECH' 99, Budapest, 
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Hungary, 1999). The baseline system had 2.3 K context dependent HMM states and 
134K diagonal gaussian mixture components and was trained on approximately 70 hours 
of data. The test set consisted of 86 messages (approximately 7000 words). The baseline 
system used 39-dimensional frames (13 cepstral coefficients plus deltas and double deltas 
computed from 9 consecutive frames). 

For the divergence and Bhattacharyya projections, every 9 consecutive 
24-dimensional cepstral vectors were spliced together forming 2 1 6-dimensional feature 
vectors which were then clustered to estimate one full covariance gaussian density for 
each state. Subsequently, a 39 x 216 transformation 9 was computed using the objective 
functions for the divergence (equation [7]) and the Bhattacharyya bound (equation [15]) ? 
which projected the models and feature space down to 39 dimensions. 

As mentioned in Haeb-Umbach et al, supra, it is not clear what the most 
appropriate class definition for the projections should be. The best results were obtained 
by considering each individual HMM state as a separate class, with the priors of the 
gaussians summing up to one across states. Both optimizations were initialized with the 
LDA matrix and carried out using a conjugate gradient descent routine with user supplied 
analytic gradient from the NAG (Numerical Algebra Group) Fortran library. (The NAG 
Fortran library is a collection of mathematical subroutines - or subprograms - for 
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performing various scientific/mathematical computations such as: solving systems of linear 
or non-linear equations, function integration, differentiation, matrix operations, 
eigensystem analysis, constrained or unconstrained function optimization, etc.) 

The routine performs an iterative update of the inverse of the hessian of the 
objective function by accumulating curvature information during the optimization. 

Figure 3 illustrates the evolution of objective functions for divergence, while 
Figure 4 illustrates the evolution of objective functions for the B Bhattacharyya bound. 

The parameters of the baseline system (with 134K gaussians) were then 
re-estimated in the transformed spaces using the EM algorithm. Table 1 summarizes the 
improvements in the word error rates for the different systems. 
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TABLE 1 



System Word Error Rate 



Baseline (MFCC+A+ AA) 39.61% 



LDA 37.39% 



Interclass divergence 36.3 2% 



Bhattacharyya bound 3 5 .73% 



In recapitulation, two methods for performing discriminant feature space 
projections have been presented. Unlike LDA, they both aim to directly minimize the 
probability of misclassification in the projected space by either maximizing the interclass 
divergence and relating it to the Bayes error or by directly minimizing an upper bound on 
the classification error. Both methods lead to defining smooth objective functions which 
have as argument projection matrices and which can be numerically optimized. 
Experimental results on large vocabulary continuous speech recognition over the 
telephone show the superiority of the resulting features over their LDA or cepstral 
counterparts. 
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Some primary applications of the methods and arrangements discussed herein 
relate to pattern recognition, including speech recognition. Other examples of pattern 
recognition, which may make use of the embodiments of the present invention, include but 
are not limited to: handwriting and optical character recognition (OCR), speaker 
identification and verification, signature verification (for security applications), object 
recognition and scene analysis (such as aircraft identification based on aerial photographs), 
crops monitoring, submarine identification based on acoustic signature, and several others. 

It is to be understood that the present invention, in accordance with at least one 
presently preferred embodiment, includes an input interface for inputting a pattern and a 
transformer for transforming the input pattern to provide a set of at least one feature for a 
classifier. Together, the input interface and transformer may be implemented on at least 
one general-purpose computer running suitable software programs. These may also be 
implemented on at least one Integrated Circuit or part of at least one Integrated Circuit. 
Thus, it is to be understood that the invention may be implemented in hardware, software, 
or a combination of both. 

If not otherwise stated herein , it is to be assumed that all patents, patent 
applications, patent publications and other publications (including web-based publications) 
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mentioned and cited herein are hereby folly incorporated by reference herein as if set forth 
in their entirety herein. 

Although illustrative embodiments of the present invention have been described 
herein with reference to the accompanying drawings, it is to be understood that the 
invention is not limited to those precise embodiments, and that various other changes and 
modifications may be affected therein by one skilled in the art without departing from the 
scope or spirit of the invention. 
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Claims 

What is claimed is: 

L A method of providing pattern recognition, said method comprising the steps 

of: 

inputting a pattern; 

transforming the input pattern to provide a set of at least one feature for a 

classifier; 

said transforming step comprising the step of minimizing the probability of 
subsequent misclassification of the at least one feature in the classifier; 

said minimizing step comprising: 

developing an objective function; and 

optimizing the objective function through gradient descent. 

2. The method according to Claim 1 ? wherein said minimizing step comprises 
maximizing an average pairwise divergence. 
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3 . The method according to Claim 1 , wherein said minimizing step comprises 
minimizing a union Bhattacharyya bound. 

4. The method according to Claim 1, further comprising the step of querying 
whether the optimized objective function converges. 

5. The method according to Claim 4, further comprising the step of repeating said 
optimizing step if the optimized objective function does not converge. 

6. The method according to Claim 1, wherein said pattern recognition is speech 
recognition. 

7. Apparatus for providing pattern recognition, said apparatus comprising: 
an input interface for inputting a pattern; 

a transformer for transforming the input pattern to provide a set of at least one 
feature for a classifier; 

said transformer being adapted to minimize the probability of subsequent 
misclassification of the at least one feature in the classifier; 

said transformer further being adapted to: 
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develop an objective function; and 

optimize the objective function through gradient descent. 

8. The apparatus according to Claim 7, wherein said transformer is adapted to 
minimize the probability of subsequent misclassification of the at least one feature in the 
classifier via maximizing an average pairwise divergence. 

9. The apparatus according to Claim 7, wherein said transformer is adapted to 
minimize the probability of subsequent misclassification of the at least one feature in the 
classifier via minimizing a union Bhattacharyya bound. 

10. The apparatus according to Claim 7, wherein said transformer is further 
adapted to query whether the optimized objective function converges. 

11. The apparatus according to Claim 10, wherein said transformer is further 
adapted to repeat optimization of the objective function if the optimized objective function 
does not converge. 

12. The apparatus according to Claim 7, wherein said pattern recognition is speech 
recognition. 
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13. A program storage device readable by machine, tangibly embodying a program 
of instructions executable by the machine to perform method steps for providing pattern 
recognition, said method comprising the steps of: 



inputting a pattern; 



transforming the input pattern to provide a set of at least one feature for a 



classifier; 



said transforming step comprising the step of minimizing the probability of 



subsequent misclassification of the at least one feature in the classifier; 



said minimizing step comprising: 



developing an objective function; and 



optimizing the objective function through gradient descent. 



YOR920000388US1 



-23 - 



MINIMUM BAYES ERROR FEATURE SELECTION 



IN SPEECH RECOGNITION 

Abstract of the Disclosure 

In connection with speech recognition, the design of a linear transformation 
0 € of rank p x n y which projects the features of a classifier x e 9T onto 
y = 9x € W such as to achieve minimum Bayes error (or probability of misclassification). 
Two avenues are explored: the first is to maximize the 9-average divergence between the 
class densities and the second is to minimize the union Bhattacharyya bound in the range 
of 6. While both approaches yield similar performance in practice, they outperform 
standard linear discriminant analysis features and show a 1 0% relative improvement in the 
word error rate over known cepstral features on a large vocabulary telephony speech 
recognition task. 
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